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Abstract. Recent advances in high-throughput genomics technologies have 
resulted in the sequencing of large numbers of (near) complete genomes. These 
genome sequences are being mined for important functional elements, such as 
genes. They are also being compared and contrasted in order to identify other 
functional sequences, such as those involved in the regulation of genes. In 
cases where DNA sequences from different organisms can be determined to 
have originated from a common ancestor, it is natural to try to infer the an- 
cestral sequences. The reconstruction of ancestral genomes can lead to insights 
about genome evolution, and the origins and diversity of function. There are 
a number of interesting foundational questions associated with reconstructing 
ancestral genomes: Which statistical models for evolution should be used for 
making inferences about ancestral sequences? How should extant genomes be 
compared in order to facilitate ancestral reconstruction? Which portions of 
ancestral genomes can be reconstructed reliably, and what are the limits of 
ancestral reconstruction? We discuss recent progress on some of these ques- 
tions, offer some of our own opinions, and highlight interesting mathematics, 
statistics, and computer science problems. 



1. What is comparative genomics? 

These notes summarize a lecture at a special session of the American Math- 
ematical Society on mathematical biology, during which we discussed the central 
problem of comparative genomics, namely how to reconstruct the ancestral genomes 
that evolved into the present-day extant genomes. This is fundamentally a statis- 
tics problem, because with a few exceptions, it is not possible to sequence the 
genomes of ancestral species, and one can only infer ancestral genomes from the 
multitude of genomes that can be sampled at the present time. The problem is a 
grand scientific challenge that has only begun to be tackled in recent years, now 
that whole genomes are being sequenced for the first time. Our aim is to introduce 
the reader to the statistical (and related mathematical) elements of the methods 
of comparative genomics, while providing a glimpse of the exciting results that are 
emerging from first generation attempts to reconstruct ancestral genomes. Due to 
the complex interdisciplinary scope of the subject, we have been forced to omit a lot 
of detail and many interesting topics, but we hope that the curious mathematical 
reader may find some threads worthy of further exploration. 
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We begin with a concrete example of one sequence from a single genome: the 
16S ribosomal RNA (rRNA) gene from Salmonella typhimurium LT2. This se- 



quence can be downloaded from the NCBI website at http : //www . ncbi . nlm . nih . gov/ 
by searching for the accession number "AE008857" . The sequence is 

1 aattgaagag tttgatcatg gctcagattg aacgctggcg gcaggcctaa cacatgcaag 
61 tcgaacggta acaggaagca gcttgctgct tcgctgacga gtggcggacg ggtgagtaat 
121 gtctgggaaa ctgcctgatg gagggggata actactggaa acggtggcta ataccgcata 
181 acgtcgcaag accaaagagg gggaccttcg ggcctcttgc catcagatgt gcccagatgg 
241 gattagcttg ttggtgaggt aacggctcac caaggcgacg atccctagct ggtctgagag 
301 gatgaccagc cacactggaa ctgagacacg gtccagactc ctacgggagg cagcagtggg 
361 gaatattgca caatgggcgc aagcctgatg cagccatgcc gcgtgtatga agaaggcctt 
421 cgggttgtaa agtactttca gcggggagga aggtgttgtg gttaataacc gcagcaattg 
481 acgttacccg cagaagaagc accggctaac tccgtgccag cagccgcggt aatacggagg 
541 gtgcaagcgt taatcggaat tactgggcgt aaagcgcacg caggcggtct gtcaagtcgg 
601 atgtgaaatc cccgggctca acctgggaac tgcattcgaa actggcaggc ttgagtcttg 
661 tagagggggg tagaattcca ggtgtagcgg tgaaatgcgt agagatctgg aggaataccg 
721 gtggcgaagg cggccccctg gacaaagact gacgctcagg tgcgaaagcg tggggagcaa 
781 acaggattag ataccctggt agtccacgcc gtaaacgatg tctacttgga ggttgtgccc 
841 ttgaggcgtg gcttccggag ctaacgcgtt aagtagaccg cctggggagt acggccgcaa 
901 ggttaaaact caaatgaatt gacgggggcc cgcacaagcg gtggagcatg tggtttaatt 
961 cgatgcaacg cgaagaacct tacctggtct tgacatccac agaactttcc agagatggat 
1021 tggtgccttc gggaactgtg agacaggtgc tgcatggctg tcgtcagctc gtgttgtgaa 
1081 atgttgggtt aagtcccgca acgagcgcaa cccttatcct ttgttgccag cgattaggtc 
1141 gggaactcaa aggagactgc cagtgataaa ctggaggaag gtggggatga cgtcaagtca 
1201 tcatggccct tacgaccagg gctacacacg tgctacaatg gcgcatacaa agagaagcga 
1261 cctcgcgaga gcaagcggac ctcataaagt gcgtcgtagt ccggattgga gtctgcaact 
1321 cgactccatg aagtcggaat cgctagtaat cgtggatcag aatgccacgg tgaatacgtt 
1381 cccgggcctt gtacacaccg cccgtcacac catgggagtg ggttgcaaaa gaagtaggta 
1441 gcttaacctt cgggagggcg cttaccactt tgtgattcat gactggggtg aagtcgtaac 
1501 aaggtaaccg taggggaacc tgcggttgga tcacctcctt acct 

This gene is a key component of the protein synthesis machinery in cells. In 
fact, it is such a basic ingredient of life, that the same sequence, albeit with minor 
modifications, exists in the genomes of all living organisms. For example, the reader 
is encouraged to compare it to the 16S rRNA gene in Escherichia coli (K12): 

1 aaattgaaga gtttgatcat ggctcagatt gaacgctggc ggcaggccta acacatgcaa 
61 gtcgaacggt aacaggaaga agcttgcttc tttgctgacg agtggcggac gggtgagtaa 
121 tgtctgggaa actgcctgat ggagggggat aactactgga aacggtagct aataccgcat 
181 aacgtcgcaa gaccaaagag ggggaccttc gggcctcttg ccatcggatg tgcccagatg 
241 ggattagcta gtaggtgggg taacggctca cctaggcgac gatccctagc tggtctgaga 
301 ggatgaccag ccacactgga actgagacac ggtccagact cctacgggag gcagcagtgg 
361 ggaatattgc acaatgggcg caagcctgat gcagccatgc cgcgtgtatg aagaaggcct 
421 tcgggttgta aagtactttc agcggggagg aagggagtaa agttaatacc tttgctcatt 
481 gacgttaccc gcagaagaag caccggctaa ctccgtgcca gcagccgcgg taatacggag 
541 ggtgcaagcg ttaatcggaa ttactgggcg taaagcgcac gcaggcggtt tgttaagtca 
601 gatgtgaaat ccccgggctc aacctgggaa ctgcatctga tactggcaag cttgagtctc 
661 gtagaggggg gtagaattcc aggtgtagcg gtgaaatgcg tagagatctg gaggaatacc 
721 ggtggcgaag gcggccccct ggacgaagac tgacgctcag gtgcgaaagc gtggggagca 
781 aacaggatta gataccctgg tagtccacgc cgtaaacgat gtcgacttgg aggttgtgcc 
841 cttgaggcgt ggcttccgga gctaacgcgt taagtcgacc gcctggggag tacggccgca 
901 aggttaaaac tcaaatgaat tgacgggggc ccgcacaagc ggtggagcat gtggtttaat 
961 tcgatgcaac gcgaagaacc ttacctggtc ttgacatcca cagaactttc cagagatgga 
1021 ttggtgcctt cgggaactgt gagacaggtg ctgcatggct gtcgtcagct cgtgttgtga 
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1081 aatgttgggt taagtcccgc aacgagcgca acccttatct tttgttgcca gcggtccggc 
1141 cgggaactca aaggagactg ccagtgataa actggaggaa ggtggggatg acgtcaagtc 
1201 atcatggccc ttacgaccag ggctacacac gtgctacaat ggcgcataca aagagaagcg 
1261 acctcgcgag agcaagcgga cctcataaag tgcgtcgtag tccggattgg agtctgcaac 
1321 tcgactccat gaagtcggaa tcgctagtaa tcgtggatca gaatgccacg gtgaatacgt 
1381 tcccgggcct tgtacacacc gcccgtcaca ccatgggagt gggttgcaaa agaagtaggt 
1441 agcttaacct tcgggagggc gcttaccact ttgtgattca tgactggggt gaagtcgtaa 
1501 caaggtaacc gtaggggaac ctgcggttgg atcacctcct ta 

There are very few differences between the genes. The Salmonella sequence 
has three extra bases at the end, and the E-coli gene one extra in the beginning, 
there are 37 differences within the genes, and no insertions or deletions. Such a 
comparison is easy to perform in the example above, but if there are many se- 
quences and lots of insertions and deletions, it can be non-trivial to identify the 
relationships among individual bases. This multiple sequence alignment problem is 
a major problem in comparative genomics, and is discussed in Section 3. In fact, 
one of our main points is that finding a good multiple alignment is the essence of 
the ancestral reconstruction problem. 

The degree of conservation of the 16S rRNA gene throughout the tree of life 
means it is a good starting point for reconstructing ancestral genomes. In particular, 
we can begin modestly by asking only for the ancestral 16S rRNA gene sequences. 
Such a reconstruction entails the following: 

(1) Identifying the 16S gene in many organisms, and determining the se- 
quences. 

(2) Obtaining a likely phylogenetic tree that relates the sequences. 

(3) Inferring ancestral sequences corresponding to internal nodes in the tree. 



This comparative genomics programme was outlined by Woese et al. |58j . It fol- 
lowed on the heels of the first comparative genomics papers, written by Linus Paul- 
ing and Emile Zuckerkandl [41L 1611 162) . They applied fingerprinting techniques 
to compare amino acid sequences of hemoglobins, finding that distant species have 
more divergent sequences than related species. The biological problem of identify- 
ing the 16S gene and rapidly finding its sequence was solved in [30] . More recently, 
new approaches have been suggested for obtaining 16S rRNA sequences, even from 
unculturable bacteria, using "community sequencing" approaches [llj . 

The narrow focus of comparative genomics on 16S rRNA ended with the ar- 
rival of fast and cheap sequencing technologies. A recent flood of ideas and re- 
search inspired by vast amounts of genome sequences has led to the reconstruction 
of numerous protein sequences and even megabases of boreoeutheriean ancestral 
chromosomes. Interesting examples of the former are |24L 1281 154] and of the latter 
[H 132] . There have been a number of recent surveys on ancestral reconstruction 
[451 153) , and a book on the topic will appear next year [31j . 

We begin in Section 2 by discussing the "easy" case of ancestral sequence 
reconstruction, where the phylogenetic tree is known and the alignment of the 
sequences is trivial. Even in this simplest case, the choice of an effective statistical 
model for evolution is non-trivial and extremely important. We introduce the reader 
to these issues by way of the simplest example possible, and provide pointers for 
further reading. In Section 3 we discuss complexities that arise when inferences 
need to be made about insertions and deletions, and when the alignment of the 
sequences is non-trivial. The difficulty of alignment is explored in more detail in 
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Section 4, where we present evidence, based on our own recent work, that suggests 
the amount of insertion and deletion in genomes has been vastly underestimated. 
This has major implications for ancestral genome reconstruction. In Section 5 we 
discuss the problem of tree reconstruction, which needs to be solved in the case 
where the phylogenetic history of the genomes being compared is unknown. This 
leads us to the field of phylogenetics [49] , where we restrict ourselves to mentioning a 
number of recent theoretical advances pertinent to ancestral genome reconstruction. 
We conclude in Section 6 with a list of open problems and a discussion of the role of 
mathematics, statistics, and computer science in reconstructing ancestral genomes. 

2. Reconstructing ancestral sequences: the "easy" case 

In the introduction, we mentioned the example of 16S rRNA sequences, and ob- 
served that these genes are conserved in all organisms. However, within restricted 
domains of the tree of life, there are examples of functional elements exhibiting 
even more sequence conservation than rRNA genes. The term ultra- conserved ele- 
ments was introduced in [3] , and is used to described genome sequences that have 
remained unchanged over millions of years. Such sequences were first discovered in 
vertebrates, and their degree of conservation is astounding. 

Example 2.1. Consider the sequence 

(2.1) tttaattgaaagaagttaattgaatgaaaatgatcaactaag 

It is located in the human genome on chromosome 7, coordinates 156,694,482- 
156,694,523 (version March 2006). The identical sequence appears in the genomes of 
every other sequenced vertebrate species to date: the chimpanzee, rhesus macaque, 
cat, dog, cow, mouse, rat, rabbit, armadillo, opossum, chicken, frog, zebrafish, 
pufferfish and fugufish. An alignment of the sequence (including an extra 6 bases 
on each end) is shown below: 

Human tatctatttaattgaaagaagttaattgaatgaaaatgatcaactaagcttgta 

Chimp tatctatttaattgaaagaagttaattgaatgaaaatgatcaactaagcttgta 

Macaque tatctatttaattgaaagaagttaattgaatgaaaatgatcaactaagcttgta 

Cow tatctatttaattgaaagaagttaattgaatgaaaatgatcaactaagcttgta 

Mouse tatctgtttaattgaaagaagttaattgaatgaaaatgatcaactaagcttgta 

Rat tatctgtttaattgaaagaagttaattgaatgaaaatgatcaactaagtttgta 

Rabbit tatctatttaattgaaagaagttaattgaatgaaaatgatcaactaagcttgta 

Cat tatctatttaattgaaagaagttaattgaatgaaaatgatcaactaagcttgta 

Dog tatctatttaattgaaagaagttaattgaatgaaaatgatcaactaagcttgta 

Armadillo tatctatttaattgaaagaagttaattgaatgaaaatgatcaactaagcttgta 

Opossum tatctatttaattgaaagaagttaattgaatgaaaatgatcaactaagcttgta 

Chicken tatctatttaattgaaagaagttaattgaatgaaaatgatcaactaagcttgta 

Frog tatctatttaattgaaagaagttaattgaatgaaaatgatcaactaagcttgta 

Pufferfish tatctatttaattgaaagaagttaattgaatgaaaatgatcaactaagcttgta 

Zebrafish tatctatttaattgaaagaagttaattgaatgaaaatgatcaactaagcttgta 

Fugufish tatctatttaattgaaagaagttaattgaatgaaaatgatcaactaagcttgta 

We postpone providing a precise definition of alignment until Section 3, but 
remark that for our purposes here it suffices to consider an alignment to be an 
ordered collection of columns. Each column contains bases from different species 
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that are homologous, i.e., that are derived from a shared common ancestral base. 
The sequence in (|2.1|) is called ultra-conserved, because there appear to have been no 
insertions, deletions or mutations since the common ancestor (* characters indicate 
columns with all characters identical) among each group of homologous nucleotides. 
In fact, in [40] we prove 

Theorem 2.2. The probability that the sequence 12.1]) was not present in the 
genome of the ancestor of all vertebrates is less than 10 -50 , assuming a Jukes- 
Cantor model of evolution for the sequences. 

The Jukes-Cantor model is a statistical model for the evolution of characters 
on trees, which is explained below. While the model has many drawbacks and does 
not describe the full extent and structure of mutation, the tiny probability is robust 
to changes in the model, and it is fairly certain that (|2.1j) is the ancestral sequence. 

The starting point for specifying an evolutionary model for biological sequences 
is the data: a collection of k sequences a 1 , . . . , a k of lengths n±, . . . , n^, each with 
characters from a finite alphabet S. We use the notation af to denote the i th 
element of a sequence and by a set of characters S = {a 1 , . . . , o~ k } we mean the 
set of m + ri2 • • • + rife sequence characters that form the sequences a 1 , . . . , a k . 
These sequences may be DNA sequences (in which case the alphabet has size 4), 
amino acid sequences (in which case the alphabet has size 20), or they may be 
sequences derived from organisms. For example, the alphabet may have size 2, 
and the sequences may represent the presence or lack of genes, or morphological 
features in different species. The elements of the sequences are called characters, 
and it is important to note that in problems of interest, k is small but n is usually 
fairly large (in the case of DNA sequences, n may be in the billions). 

We restrict our discussion here to the evolutionary model called the Cavender- 
Farris model |10j . This is the simplest of a class of continuous time Markov models 
for trees that are used in biological sequence analysis, and although it is very 
simple, it captures many of the elements of the evolutionary models that are used in 
practice. The Cavender-Farris model is an evolutionary model for binary sequences 
(i.e., the alphabet has size 2), and for sequences of the same length (m = • • • = 
n k = n). 

The first ingredient of the Cavender-Farris model is a directed rooted tree on 
k leaves. The root of the tree has biological significance: it is the common ancestor 
of all the sequences. The leaves of the tree also have special significance: they 
correspond to the k different sequences whose evolution is being modeled. The 
internal vertices of the tree correspond to speciation events. Edges of the tree are 
directed from the root to the leaves, and the directions of the arrows specify the 
direction of time. 

The second ingredient of the Cavender-Farris model is a collection of 2 x 2 
matrices associated with each edge of the tree. These matrices are all of the form 



where i is an edge of the tree, < tt, < 1, and /Ltj = 1 — m. These matrices have 
a biological interpretation: they describe the probabilities of characters changing 
along branches of the tree. Although in principle these probabilities may be different 
for each edge and for each character within each sequence, the Cavender-Farris 
model specifies a single probability of change Wi for each edge i. If the vertices 
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adjacent to the edge i are Vi and w% with i oriented from vt to Wi, then for the jth 
character of the sequence at Vi, iTi is the probability that the jth character of the 
sequence at Wi is different. 

The Cavender-Farris model is an example of a continuous time Markov chain 
model on a tree. To see the connection to Markov models, consider the 2x2 square 
matrix 

(2.3) Q = (~ a °L>0. 

\ a —a J 

The rows and columns of Q are indexed by E = {0, 1}. Note that the matrix Q 
has the following properties: 

qij > for i j, 
qij = for all i £ E, 

ies 

qu < for all i € E. 

Any square matrix Q (of arbitrary size) with the above properties is called a rate 
matrix. The following is a straightforward result about continuous time Markov 
chains [39 . 

Theorem 2.3. Let Q be any rate matrix and 9(t) = e Qt = YnLo V.Q 1 ^ ■ Then 

(1) 9(s + t) = 9(s) ■ 9(t) (Chapman-Kolmogorov equations), 

(2) 9(t) is the unique solution to the forward differential equation 
9'(t) = 9(t) ■ Q, 9(0) = 1 for t > (here 1 is the identity matrix), 

(3) 9(t) is the unique solution to the backward differential equation 
9'{t) = Q ■ 9{t), 9(0) = 1 for t > 0, 

(4) 6>( fe )(0) = Q k . 

Furthermore, a matrix Q is a rate matrix if and only if the matrix 9(t) = e®* is a 
stochastic matrix (non-negative with row sums equal to one) for every t > 0. 

Note that for the binary rate matrix (|2.3p . we have 

m ~ 2 VI - e- 2aU 1 + e- 2aU J ■ 
The expected number of mutations over time t is the quantity 

(2.4) at = -- -trace(Q) -t = -- ■ logdet(0(i)). 

This number is called the branch length. It can be computed from the substitution 
matrix 9(t) and is the expected number of mutations. For this reason it is used 
instead of t to label edges in a phylogenetic tree. 

The matrices 9i in the Cavender-Farris model are parameterized by a and t: 

(2-5) ^ = \(l + e-^% ^ = I(i_e- 2 «*). 

One way to specify an evolutionary model is to give a phylogenetic tree T 
together with Q and an initial distribution for the root of T (which we here assume 
to be the uniform distribution on E). The branch lengths of the edges are unknown 
parameters, and the objective is to estimate these branch lengths from data. Thus, 
if the tree T has r edges, then such a model has r free parameters. 
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Returning to Theorem 12. 2\ we note that the Jukes-Cantor model is just the 
Cavender-Farris model with |S| = 4. That is, the Q matrix is given by 



(2.6) 



Q 



(- 
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a \ 




a 


—3a 
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a 
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—3a 


a 
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a 


-3a/ 



, a > 0. 



In this case the branch length is given by Sat (this should be compared with (2.4)). 

The Cavender-Farris/ Jukes-Cantor models are too simple to be used in prac- 
tice. Point mutations in genome display various asymmetries, and the general 
reversible Markov model is preferred. The model in (2.7,2.8) is realistic, and was 
estimated from observed synonymous substitutions (those that do not change the 
amino acid) in human-mouse-rat alignments [60] . Note that 7r a ,7Tc,7r<3 and itt are 
the equilibrium frequencies, and are also parameters in the model. 



(2.7) 
(2.8) 



Q 



/-1.05 0.19 0.71 0.15 ^ 

0.17 -0.96 0.18 0.61 

0.60 0.17 -0.95 0.17 

\ 0.15 0.72 0.20 -1.07/ 

7TA = 71"T = 0.23, 7T(3 = ttc — 0.27. 



The models used can be even more general, including local dependencies between 
sites and different functional categories for the ancestral sequences that alter mu- 
tation rates. There is an extensive literature on this topic, as well as many papers 
discussing the reliability of reconstructed characters in aligned sequences (e.g. |43J). 

The final aspect of Theorem 12.21 that we have not yet discussed is a computa- 
tional one, namely how to compute the probability given the sequences, the tree 
and the model. The algorithm is known as Felsenstein's algorithm |20j and in- 
volves dynamic programming on a tree. In the context of ancestral reconstruction, 
[44) show how to modify Felsenstein's algorithm for fast joint reconstruction of all 
ancestral sequences. It is best to view all these algorithms as a special case of the 
Junction Tree algorithm for a graphical model where the graph is a tree. 

We conclude by noting that there is a direct connection between evolution- 
ary models such as the Cavender-Farris model and the emerging field of algebraic 
statistics. This is because the families of probability distributions on the leaves 
are essentially parameterized algebraic varieties, and for this reason the tools of 
commutative algebra and algebraic geometry can be used to study the model and 
develop inference methods. We refer the interested reader to the recent book 
for an introduction to the subject. 



3. Alignment 

The models for ancestral reconstruction in Section 2 do not account for in- 
sertions and deletions (indels). This is a serious drawback because the homology 
between muliple sequences is complicated by insertions, deletions, rearrangements, 
segmental duplications, and other evolutionary events. We restrict our discussion 
in this section to the issues regarding ancestral reconstruction in the presence of 
insertions and deletions only. 
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Definition 3.1. A partial global multiple alignment of sequence characters S = 
{cr 1 , . . . ,a k } is a partially ordered set P = {ci, . . . , c m } together with a surjective 
function ip : S — > P such that <p{crf) < <p[o~j) if a < J. 

The elements of P correspond to columns of the multiple alignment, and the 
partial order specifies the order in which columns must appear. Note that the func- 
tion ip specifies the homology relationships among the sequences: two characters 
that are mapped to the same element in the poset are homologous. Thus, an align- 
ment is just a pair (P,ip). We call P an alignment poset, and note that unless P 
is a total order (chain) , there are columns of the partial multiple alignment whose 
order is unspecified. A linear extension of a partially ordered set P = {ci, . . . , c m } 



--NGYE 

--SYYS 

ELIGK-P-Q 
SL KQ 

1 2 3 4 5 6 7 8 9 



Figure 1. A set of four sequences, an alignment poset together 
with a linear extension, and a global multiple alignment. The 
function from the set of sequence elements to the alignment poset 
that specifies the multiple alignment is not shown, but is fully 
specified by the diagram on the right. For example, the second 
element in the first sequence is a\ — G, and <p(o-\) corresponds to 
the fourth column of the multiple alignment. 

is a permutation of the elements ci, . . . ,c m such that whenever a < Cj, i < j. 
A global multiple alignment is a partial global multiple alignment together with a 
linear extension of the alignment poset P (see Figure [lj. The alignment in example 
12. H is a global multiple alignment where the poset is a chain. 

The problem of finding a multiple alignment for a collection of sequences is 
to find a "good" poset and function ip that are likely to determine the homology 
correctly. This is a non-trivial problem, and its solution is essential for accurate an- 
cestral reconstruction of sequences. A complete discussion of multiple alignment is 
beyond the scope of this section, but we briefly review pairwise sequence alignment. 

In the case of two sequences, an alignment poset is always a disjoint union of 
chains. Note that for two sequences of length n and m, there are (™„ ) alignments. 
We consider a simple criterion for selecting an alignment. Given three parameters 
X, S and M, we assign a score to an alignment as follows: 

(3.1) score (P,<p) = M ■ #M + X ■ #A + S ■ #5, 

where #5 = \{x G P : | c,c 1 (a?) | = 1}| (the number of spaces in the alignment), 
#M = \{a G a l ,b G : (fi(a) = (p(b), chara = char6}| (the number of matches), 
and #A = \{a G a 1 ,6 G a : f{a) — tp(b), chara ^ char6}| (the number of mis- 
matches). The problem of maximizing (|3.1j) can be solved efficiently using dynamic 
programming, using what is known as the Needleman-Wunsch algorithm |36j . As 
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with Fclscnstcin's algorithm, this is just a special case of the Junction Tree algo- 
rithm, in this case for a graphical model known as the pair hidden Markov model. 
Despite the name, pair hidden Markov models are non-trivial modifications of stan- 
dard hidden Markov models models, in that the structure of the models is not fixed. 
In the graphical model literature, these types of models are referred to as Bayesian 
multinets. The connection between alignments and pair HMMs provides a proba- 
bilistic interpretation for the score in (|3.ip . For details see pQ or [39j . 

An alignment specifies the homology relationships among nucleotides or amino 
acids, but provides no information about ancestral sequences. In order to make 
inferences about ancestral sequences, it is necessary to use the alignment to estimate 
the locations, sizes, and times of insertions and deletions. 




TTCTCATTGCTACCTGACCATCTGTM^ATGGCTGT1|--UCAATCCACTCAGGCTGTT- 
nCTCAnGCTACCTGACCATCTGTAGAATGGCTGTl - ^CAATCCACTCAGGCTGTT- 
TTCTCATTGCTACCTGACCATCTGTAGAATGGATGT1 - WAATCTACTCAGGCTGTT- 
TTCTCATTGCTACCTGACCATCTGTAGAATGGATGn - MjAATCTACTCAGGCTGTT- 
TTCTCATTGCTACCT&ACCATCTGTAGAATGGATGTT - WjAATCTACTCAGGCTGTT- 
TTCTCATTGCTACCTGACAATCTGTAGAACAGAAGT7 - ^CAATCCATTCAGGTTGTT- 



-TC 
-TC 
-TC 
-TC 
-TC 
-TC 



TTCTCATTGCTACCTGACGATCTGTAGAATAGAAGTT - kCAATCCATTCAGGTTGTT TC 

TTCGCATTGCTACCTGACGATCTGTAGAATAGAAGT^--|^CAATCCATTCAGGTTGTT TC 

GCTATATCACCATTGTnCTTAGAGCTGAATAGTACTC^ATGGTATACATATACCACATTTTATTAATCCATTCTTGGAT 
TTCTCATTGCTCCCTGACAATCTACACCATC^TGCCTCAAGATTTAGTCAGTCCCCT GTTCATGA GCAGTTAGGTTGTT 
GTAAMTTAATATmAAAATCATTTTTACCAM^TfjG^ 



GTAAMTTGATATTTTATAAnATTTTAAGCAGGKTAGAGAGTTGGCTCAGTGGGTA 



fcGAGTGCTTGCTCC 



CTAAAGAAAAMTTTnCnATnCATTTTTTATTAGTTTTTATGTGTAAAACATCAA! ======= jGATGTAATATTATT 

TTTTCATTGC^^TGATATTTnTAGAATGGGTATTTCCAMTTC^TTAATCATACGCTTCTTCCTAAGCATTCAGGT 
TTTTCATTGi — LCTAACATTCTGTAAAACGGGTGTTTAAAATTTCAnTAATCAnCCCCTCTTCGTGGGCAGGJUWSGT 
TCTTCATTT[-^TTAATATTCTATAGAATGGGTATTTCMAAGTCACTTACTTGTTC-CCTCTTCCTGAGCATTCAAAT 
ACAC>ATTCCTCACTGnCTTTATCGATGCGTAGTATTCCAATGTGTC'M^ 



FIGURE 2. Alignment of 17 species from the CFTR region [50] . 



Example 3.2. Figure 2 shows an alignment of 17 species from the cystic fibrosis 
transmembrane conductance regulatory region (CFTR) produced for the ENCODE 
consortium [18] , The phylogenetic tree on the left can be used in conjunction with 
the alignment to identify indel events. The boxes shown in the figure show the most 
parsimonious explanation (minimum number of events), based on the tree [50] . 

Given a phylogenetic tree T, together with a pair hidden Markov model asso- 
ciated to each edge of T, the tree alignment problem is to find sequences a 1 where 
i ranges over all the internal nodes of T such that IIe=(i j)eE(T) P{ a% > IS maxi- 
mized. The inference performed in Example 3.2 [50] is based on a restricted version 
of tree alignment. In general, there is a ratio- two approximation algorithm for the 
problem that runs in quadratic time, together with a polynomial time approxima- 
tion scheme [56] . 

4. The limits of ancestral reconstruction: indel saturation 

Studies of the reliability of ancestral reconstruction have mostly focused on 
point mutations and the effects of using different types of evolutionary models 
[57j . Such studies implicitly assume that multiple alignment, while difficult, is 
a tractable problem and that indels can be effectively ignored, or else accounted 
for using procedures such as tree alignment. In this section we provide indirect 
evidence that even over relatively short time-scales, large number of insertions and 
deletions make it impossible to align sequences (as there is little or no homology), 
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and therefore it is impossible to reconstruct ancestral sequences. Our arguments 
are based on estimating the overall numbers of indels from length differences of 
homologous segments. 

We illustrate our ideas with a calculation based on the comparison of introns in 
four species of the Drosophila (fruit fly): melanogaster, pseudoobscura, yakuba and 
virils. Introns are sequences within genes that are spliced out after transcription 
but before translation, so that they do not contribute to the amino acids in the 
protein. Although it has been shown that introns can be important for regulating 
the expression of genes, they consist of mostly non-functional sequence. The reason 
to examine Drosophila is that multiple genomes have been sequenced during the 
past year. The reason to look at introns is that it is easy to identify homologous 
introns when the genes they reside in are unambiguously homologous. 

Our methods provide an overall assessment of the number of inserted and 
deleted nucleotides along each branch of the tree relating the species. The inference 
is based on a maximum likelihood model for Brownian motion on a tree, originally 
developed by Felsenstein |21j for modeling changes in gene frequencies over time. 
Our application of this model to estimating indel rates from length differences of 
introns is new, and is the first such analysis with more than a pair of species. Pre- 
vious work on the comparison of lengths of pairs of homologous introns |37L 159] 
has mainly revealed that there is a correlation between the length differences and 
the distance between species. 

We explain the Gaussian model we consider completely but briefly, for more 
detail and background we refer the reader to [21] 138] . Let T be an unrooted tree 
with n leaves, together with labels v = Vi, . . . , «2n-3 on the edges parameterizing 
variances of normal distributions. Let xfj be quantitative characters generated 
from a Brownian motion model on a tree. We use the term quantitative character 
to refer to uncountable state spaces, in this case xfj G R. The ijth contrast is 
Aa;f„- = x s — X s . Note that Aa;?„- is drawn from a normal distribution with mean 

and variance Vij. Let C = . . . , i n -xj n -i} C (2 ) be a spanning tree of the 

complete graph K n . In the case of four taxa which we will consider (see Figure 3), 
the C-covariance matrix for C = {12, 13, 14} is 

(Vl5 + V 25 U15 V 15 \ 

VIS Vis + V36 + V 5 6 Vis + ^56 

V15 V15 + V 56 V15 + V 46 + V 56 ) 

We let Axp = [Axf l j l , . . . , Ax s n _ 1 ,- n _J and denote the determinant of Sc by 
|Ec|- The log- likelihood of the data is given by 

(4.2) lnLtAx^Jv) = -^k]n2n \ln |S C | + \ £ ((Ax^fl^ 1 Ax^) . 

" s— 1 

This specifies the model we will use completely. The next Lemma follows from the 
"Pulley principle" [2T] , 

Lemma 4.1. The log-likelihood j4-2\ ) does not depend on the choice ofC. Fur- 
thermore, \4 -Sty is linear in the (IJ) numbers dij — X)s=i \^ x ij) ■ 

In other words, if we let d = {<^ij}™, = ij then the maximum likelihood estimator 

(4.3) v2 — argmax v lnL(d|v). 
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Figure 3. Tree with four leaves. 

is well-defined and it therefore makes sense to refer to the "log-likelihood function 
for the contrast Brownian motion model on a tree" . 

Example 4.2 (n=3, from [21]). The likelihood equation becomes 
bxL{d 12 ,di 3j ,d23\vi4,,V24,vu) = -pln27r- -In (t>i 4 f24 + ^14^34 + v 2 ±v 34 ) 

V 14^23 + ^24^13 + ^34^12 
2(^14^24 + W14W34 + W24V34) ' 

The critical equations are easy to solve and one finds that 

Via = {di2 + d 13 - d 23 )/(2p), 
V24 = {d 23 + di2 - d 13 )/(2p), 
V34, = (dx3 + d 2 3-di 2 )/(2p). 

Note that in the case p = 1, if d is a tree metric then Vij — dij. This is true for 
general n, and is a restatement of the fact that the maximum likelihood estimator 
is consistent. 

Example 4.3 (n=4). There are five critical equations in the n — 4 case: 

«25(«36 + Vm) + V 56 (v 3 6 + V 46 ) = (^34(^25 + V 56 ) + d 24 V 35 + ^23«45) /(2p), 

vi5(v 3 6 + vm) + v 5e (v 36 + v 46 ) = (d 34 (vi 5 + v 56 ) + di 3 v i6 + d M v 36 ) /(2p), 

V46(v 3 6 + V4,e) + V 56 (v 14: + V 24 ) = (^12(^46 + v 56 ) + d 14 v 2b + d 2i v 1 r ) /(2p), 

v 36 {v 36 + v 46 ) + v 56 (v 14 + v 24 ) = (d 12 (v 3e + v 56 ) + d 23 v 15 + d 13 v 25 ) /(2p), 

(fi5 + v 25 ){v 3 6 + v 46 ) = (d 34 (vi 5 + v 25 ) + di 2 (v 36 + w 46 )) /{2p). 

The solution of these equations is an exercise in elimination, and can be done using 
Grobner bases methods [39j . Using the pulley principle we can restrict ourselves 
to V15 + V25 = di 2 and + {Qj = d 34 , in which case we find one unique critical 
point: the global maximum. The solution consists of 5 enormous rational functions 
in the dij which we omit here due to lack of space. 

The data analyzed was obtained from [59] . We restricted our attention to 
introns in the range of 100 — 500bp, and cleaned up the dataset by removing du- 
plicates or cases with ambiguous homology. We then computed, for each pair of 
species, the "distance" 

(4.4) d i0 >jA,-; ; r. 

s=l 
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Figure 4. Differences in lengths between homologous introns in 
Drosophila melanogaster and pseudoobscura. Introns with similar 
numbers of insertions and deletions display little length difference, 
but that docs not mean that introns with small length differences 
have not undergone insertion and deletion. 

where Aa;^ is the difference in length between the sth pair of introns in species i 
and j. This is just the quantity that appears in Lemma |4. II It is important to note 
that the use of the Brownian motion model on the dij constitutes an approximation 
to a Poisson process model for indels. We omit the details of this relationship and 
again refer to [38] . 

Returning to the data, we show a histogram of the differences between the 
intron lengths for a pair of species in Figure 4. These are the raw data we use 
to compute the distances, and then the maximum likelihood estimates for the 
branch lengths, which are the total number of indels. The maximum likelihood 
estimates are 

VdmeLb = 3204, 
Vdyak.,5 = 2521, 
Vdpsefi = 9595, 

v d vir,6 = 49894, 
v 56 = 16169, 

where vertices 5,6 are the internal vertices in the four taxa tree (as in Figure 3). 
In other words, we estimate that the total number of inserted and deleted bases 
between D. melanogaster and D. Yakuba is 5725, and between D. melanogaster and 
D. pseudoobscura we obtain 3204 + 16169 + 9595 = 28968. 

These numbers are much larger than the mean length of the introns we con- 
sidered (note that the longest introns had length 500). The conclusion is that the 
total number of inserted and deleted bases is far larger than the size of the intron. 
This may seem surprising at first, but is a reflection of the fact that insertions and 
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deletions cancel each other out, and therefore a small difference in intron length 
does not necessarily indicate a lack of indel activity. In fact, the intuition that 
homologous introns of similar length must contain homologous nucleotides is false. 
The difference of two Poisson distributions is Skellam distributed, and if the rates 
are the same then the peak is at 0, exactly what we see in Figure 4. 

The Brownian motion model we have proposed is (too) simple; it is the indel 
equivalent of the Jukes-Cantor model for point mutations, but it is sufficiently 
realistic to suggest that it may be impossible to reconstruct ancestral intron due to 
excessive indel turnover. The large amount of insertion and deletion should not be 
surprising in the light of the existence of transposable elements. These are repetitive 
elements that make up a large fraction of many genomes. The term transposable 
elements groups several subclasses of elements that replicate autonomously in the 
genome, either through reverse transcription, or directly from DNA to DNA via 
excision and repair. Up to half of the human genome is composed of such elements, 
and although they are sometimes thought of as "parasitic elements" , somewhat 
like viruses, they clearly play an integral role in shaping genome evolution, and in 
many cases are believed to influence gene function. Unfortunately, they confound 
attempts at reconstructing genomes, by virtue of creating enormous turnover in the 
sequences. At the very least, a complete catalog of such elements will be essential 
for reconstructing ancestral genomes. 

5. Tree reconstruction 

In the previous sections we have been assuming that the phylogenetic tree for 
the species under consideration is known. This assumption is, unfortunately, rarely 
justifiable. Molecular based phylogenies may not conclusively determine certain 
branchings in a tree, and fossil-based phylogenies tend to have low resolution. We 
mention two best-case examples where despite substantial work there is still some 
disagreement as to the actual phylogeny. In vertebrates, molecular techniques are 
not in agreement with other methods used for the rodents (the so-called "rodent 
problem" [2j 152] ). and in Drosophila there is disagreement about the splits among 
Drosophila erecta, yakuba and melanogaster |42l . In other branches of the tree 
of life, the situation can be that nothing at all is known about the details of the 
phylogeny. Thus, phylogenetic trees must be inferred, and the topology of the trees 
has a direct bearing on the reconstructed ancestral sequences |46j . 

We begin by describing a likelihood-based strategy that can be followed, but 
that is computationally infeasible in practice: for each tree, the probability of 
the known sequences may be computed for a specific evolutionary model, and one 
can select the tree/evolutionary model combination with maximal likelihood. This 
is known as the maximum likelihood approach to phylogeny reconstruction. The 
reason the algorithm proposed above is computationally intractable for trees with 
many leaves is that the number of binary trees with k leaves is (2/c — 5)!!. Moreover, 
the problem of finding the maximum likelihood branch lengths for a fixed tree is 
very difficult |25j . The field of phylogenetics research is very active and it is only 
recently that the following result was published, quantifying the difficulty of the 
tree reconstruction problem: 

Theorem 5.1 (|12j). Given a set of binary strings, all of the same length, and 
a negative number L, it is NP-hard to determine whether there is a tree T such 
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that the log likelihood of the sequences for the tree T with optimal branch lengths is 
greater than L. 

On the positive side, there are theoretically sound approaches to phylogenetic 
reconstruction that are also practical for large datasets. In the context of ancestral 
reconstruction there are often many taxa to be considered, and the favored approach 
is neighbor joining |47| (sometimes other closely related distance-based algorithms 
are used [8j). The neighbor joining algorithm takes as input a dissimilarity map on 
a set of taxa X. This is a map i:IxX->I that satisfies 6(i,j) = 6(j,i) and 
5(i,i) = 0. The quantities 5(i,j) are maximum likelihood estimates of the branch 
length (see !2.4j) between every pair of taxa. The algorithm is: 

(1) Given a dissimilarity map 5, compute the Q-criterion 

Q s (i, j) = (n - 2)8(i, j) - 5(i, k) - ^ S(j, k). 

Then select a pair a, b that minimize Q$ as motivated by the following 
theorem: 

Theorem 5.2 f [47] ) . Let 5t be the tree metric corresponding to the 
tree T. The pair a, b that minimizes Qs T (i,j) is a cherry in the tree. 

(2) If there are more than three taxa, replace the putative cherry a and b 
with a leaf j b, and construct a new dissimilarity map where 5{i,j a b) = 
\{8{i, a) + 5(i, b)). This is called the reduction step. 

(3) Repeat (1) and (2) until there are three taxa. 

Neighbor-joining is fast: existing implementations run in 0(n 3 ) where n is the 
number of taxa, and it has been observed (empirically) to produce good results 
[29) . However, despite the ubiquitous use of the algorithm, very little about it has 
been understood until recently. Exciting new results include: 

• The development of fast neighbor-joining which achieves an optimal run 
time of 0(n 2 ) [Tf] . 

• A uniqueness theorem for the algorithm [7]. 

• An answer to the question of "what does neighbor-joining optimize"? [23j . 

• An answer to the question of "when (and why) does neighbor-joining 
work"? [35]. 

Together, these results provide new insight into the algorithm, and open up the 
possibility for significant improvements in accuracy. 

Returning to maximum likelihood phylogenetic reconstruction, recent results 
also show that it is possible to efficiently reconstruct the topology of trees (with 
high probability) using likelihood models of the type described in Section 2 given 
only polylogarithmic quantities of data. 

Theorem 5.3 ( [13] ). Under the Cavender-Farris model, there is a constructive 
algorithm that reconstructs almost all trees on k leaves with sequences of length 
k = 0(poly(logk)). 

A recent related result [34] provides an alternative analysis that quantitatively 
couples the reconstruction problem with the ancestral reconstruction problem. In- 
deed, it appears that ancestral sequence reconstruction and tree reconstruction are 
far more related than originally thought. 
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6. Open problems and discussion 

A recent survey article [45] proposes that "an integrated, multi-disciplinary 
approach is needed in order to make progress on ancestral genome reconstruction" . 
We agree with this point of view, and in the spirit of the proposal offer an invitation 
to mathematicians, statisticians and computer scientists by highlighting some open 
problems that may form a starting point for research and collaboration. We focus 
on problems important for biology, but many of the questions also lead to interesting 
mathematics |51j. 

The Cavender-Farris model introduced in Section 2 is the simplest example of 
an evolutionary model. A central problem in genomics is to find appropriate models 
that effectively capture the mechanisms by which sequences changes, but that are 
also useful for inference. An important class of models that have been proposed 
are phylogenetic hidden Markov models 1331 148j . 

Problem 6.1 (Phylogenetic hidden Markov models). Find efficient algorithms 
for inference with phylogenetic hidden Markov models. 

See [27] for an introduction to this problem and some first steps exploring the 
use of variational methods and other graphical model techniques. 

In Section 4, we raise the issue of alignability of sequences, and the implications 
for ancestral genome reconstruction. There are other approaches to addressing the 
problem of alignability: In [14] , we show that the choice of parameters is crucial for 
correctly identifying homologous transcription factor binding sites. The methods 
used are those of parametric alignment, which is a geometric approach to studying 
the dependence of optimal alignments on parameters. We propose the following 
problem based on |22j . 

Problem 6.2 (Parametric ancestral reconstruction). Develop polyhedral algo- 
rithms for the ancestral reconstruction problem. In particular, what implications 
does the Jukes- Cantor function of [14] . have for ancestral reconstruction? 

In a similar vein, and inspired by our observation in Section 4 that the number 
of indels in introns may preclude ancestral reconstruction, we ask: 

Problem 6.3 (Indel saturation). What are the limits on ancestral reconstruc- 
tion as determined by indel rates and distances? 

In the field of phylogenetics, we offer two problems chosen for their specific 
relevance to the ancestral genome reconstruction problem. For readers interested 
in algebraic geometry, we mention [19] for further mathematical problems. 

Problem 6.4 (Tree reconstruction and alignment). Find efficient algorithms 
for reconstructing a tree under the tree alignment model. 

The next problem is especially important for ancestral reconstruction of bac- 
terial genomes where there is a lot of horizontal transfer: 

Problem 6.5 (Consistency theorems for networks). Extend the robustness 
analysis of [35 to generalizations of the neighbor joining algorithm that project 
dissimilarity maps onto phylogenetic networks 9, 26| rather than trees. 

The reconstruction of ancestral genomes involves more than the inference of 
ancestral sequences based on groups of homologous extant nucleotides. The or- 
der of sequences is also important, and an important component of whole genome 
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reconstruction is the inference of the ancestral order of genomic segments. The 
problem is closely related to the whole genome alignment problem [15] . In this 
regard, Definition 13.11 is too restrictive. For example, it does not allow for homol- 
ogy relationships where there have been rearrangements, inversions, and segmental 
duplications. In our opinion, a major problem that needs to be solved, where a 
close collaboration between biologists and mathematicians is necessary is: 

Problem 6.6 (What is an alignment?). Provide a definition for whole genome 
alignment that is based on a comprehensive biological definition of homology. 

There are formulations of alignment different from 13.11 but they arc also too 
restrictive. Nevertheless, we mention one important approach to inference of an- 
cestral order: 

Definition 6.7. A reversal alignment between two genomes is a signed per- 
mutation. 

For example, the signed permutation 
(6.1) 17-6KT98-211-3-54 

is a reversal alignment between the human and mouse X chromosomes (this is an 
example from [5]). This means that there is a division of the human X chromosome 
into 11 pieces (equivalently 11 breakpoints) such that if they are labeled, in order, 
12 3 4 5 6 7 8 91011, then they appear in the mouse in the order (|6.1j) . Note that 
the negative signs specify the direction of the segments, with a negative indicating 
reversal and complementation. A reversal operation involves reversing the order 
of a segment of a signed permutation, and flipping the signs. For example, by 
a reversal of (|6.1|) can consist of changing the segment 11~ 3~ 5 4 to 4~ 5~ 3 11. 
Biologically, reversals correspond to rearrangement events. The reconstruction of 
ancestral order is equivalent to 

Problem 6.8 (The median problem). Given a phylogenetic tree T, a distance 
measure between signed permutations, and signed permutations labeling the leaves 
of T, find signed permutations 7r' where i ranges over all the internal nodes of T 
such that the sum of the distances between permutations adjacent in the tree is 
minimized. 

The case where the distance measure is the reversal distance is already interest- 
ing and difficult, but in practice more complex distance measures need to be used 
(that allow for multiple chromosomes and other events, such as duplications). For 
more on the problem see [6], I16L 155] . 

We conclude by noting that although we have not discussed it in this paper, 
ancestral reconstructions of proteins can be sequenced and tested for their physio- 
chemical properties (e.g., [281 154] ). Thus, ancestral reconstructions are not merely 
theoretical exercises. This exciting aspect of the field continues to be developed, 
and will hopefully lead to tests not just of genes, but also of ancestral regulatory 
elements and larger genome segments. 
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